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Abstract 

a 

We argue that relationships between Web pages are functions of the user's intent. We identify a 
class of Web tasks - information-gathering - that can be facilitated by a search engine that provides 
0^ ' links to pages which are related to the page the user is currently viewing. We define three kinds of 

intentional relationships that correspond to whether the user is a) seeking sources of information, 
b) reading pages which provide information, or c) surfing through pages as part of an extended 
. information-gathering process. We show that these three relationships can be productively mined 

using a combination of textual and link information and provide three scoring mechanisms that 
. correspond to them: SeekRel, FactRel and SurfRel. These scoring mechanisms incorporate both 

textual and link information. We build a set of capacitated subnetworks - each corresponding to a 
particular keyword - that mirror the interconnection structure of the World Wide Web. The scores 
' are computed by computing flows on these subnetworks. The capacities of the links are derived 

from the hub and authority values of the nodes they connect, following the work of Kleinberg (1998) 
' on assigning authority to pages in hyperlinked environments. We evaluated our scoring mechanism 

. by running experiments on four data sets taken from the Web. We present user evaluations of the 

lO ' relevance of the top results returned by our scoring mechanisms and compare those to the top results 

returned by Google's Similar Pages feature, and the Companion algorithm proposed by Dean and 
Henzinger (1999). 
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1 Introduction 



^ \ The tremendous success of collaborative filtering-based recommendation systems (see e.g. [20]) in online 
' retail settings (e.g. Netflix) has demonstrated that users welcome guidance while looking for books to 
buy or films to rent i.e. where they are not looking for a product which satisfies a general specification 
rather than a specific product. In the enterprise search space, the increasing importance of faceted 
search - essentially a method of providing recommendations to satisfy a user's search needs by creating 
multiple taxonomies - pioneered by Endeca [12], under the name "guided navigation", shows that 
businesses are recognizing that they can improve profitability by effectively helping their employees 
and customers browse through large databases by providing search results related to the ones users 
express a preference for. 

But search engines for the World Wide Web have been largely unsuccessful in providing accurate 
and helpful recommendations to their users. Research has found that most Web users are not using 
advanced features provided by search engines; it has been shown that they barely understand what 
these features do |29j. Further, it has been seen that Web users search less and browse more [2]. And, 
in fact, the process of developing expertise in using the Web coincides with an increase in browsing 
and decrease in searching [8]. 
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Despite this bleak scenario it is our contention that search engines have the resources to effectively 
provide users with related pages that can help in information-gathering tasks. Further we believe that 
a search engine which can do this will increase its value for its users, with consequent increases in 
revenue. There is a growing understanding in the search domain that user intent is crucial to the 
search process [27^ [TS] . Extending this understanding to the domain of relationships between pages 
can make information-gathering tasks easier. 

In other words, to know the relationship between two pages, we must first know what function these 
two pages serve for the users who visit them. It is only by characterizing the task the user is engaged 
in that we can offer related pages and hope that these will actually facilitate the task. In this paper 
we provide a suite of scoring mechanisms that relate Web pages: SeekRel, FactRel and SurfRel. The 
purpose of these scoring mechanisms is to help identify pages which may be related to the current page 
depending on whether the user is reading pages which provide information, seeking links to sources of 
information, or surfing through pages as part of an extended information-gathering process. 

A brief overview of our method is as follows: We compute our scores using flow calculations in a set 
of subnetworks of the Web. Each subnetwork corresponds to a single keyword. The set of subnetworks 
used to score a pair of pages is decided by finding the keywords relevant to the pair of pages being 
scored. Then the edges of these networks are capacitated using the hub values jl8j of their originating 
pages. Finally flow is sent along these edges towards special nodes we call witnesses. The amount of 
flow that can be routed is used as a measure of the relationship. 

Organization. In Section [2] we focus on information-gathering tasks and identify how our recommen- 
dations can facilitate them. We survey previous work done in relating Web pages in Section [3l Our 
specific scoring mechanisms are described in Sections 14.11 and 14.21 A comparison of our results for a 
small toy network with the results produced by two recent proposals PageSim |21] and SimRank [16] 
is presented in Section 14.31 In Section [5] we present experiments conducted on real data taken from 
the Web. We discuss the results of user surveys that compared the top results produced by our scor- 
ing mechanisms with the results given by the Companion algorithm of Dean and Henzinger |10] and 
Google's Similar Pages feature. Finally in Section [U we conclude by discussing the merits of our scheme 
in comparison to these two proposals and by arguing that our scheme is a better candidate for inclusion 
in a search engine than the Companion algorithm. 

2 Information-gathering on the Web 

The tasks that Web users undertake were classified by Broder [6] to generally fall into three categories: 
navigational (finding specific pages), informational (seeking facts) and transactional (performing some 
interactive set of tasks.) Kellar [T7] further classifies informational tasks into information-seeking, 
information exchange and information maintenance tasks. It is in the first of these classes, information 
seeking, that search engines make their major contributions. Kellar differentiates between three types 
of information seeking tasks: fact-finding e.g. directions to a friend's house or exam dates, information- 
gathering e.g. tectonic movements, Mac laptops, browsing news, friend's homepage. See Figure [J for a 
schematic representation of Kellar 's classification (Source [17].) 

Our focus is on information-gathering which we distinguish from fact finding by using Kellar's 
definition: Information- Gathering consists of tasks in which a user is collecting information, often 
from multiple sources, in order to write a report, make a decision, or become more informed about a 
particular topic [17, pp 67]. With this definition in hand let us try to characterize this class of tasks. 
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Figure 1: Kellar's classification of Web information. Source: Kellar, 2007. 



2.1 The iterative nature of information-gathering tasks 

In their seminal work Belkin, Oddy and Brooks [3l H] pointed out that assuming that a user with a 
need for information will be able to specify that need exactly is a mistake. Instead, they pointed out 
that a user comes to an information retrieval system with an anomalous state of knowledge, which she 
then attempts to express as a query. Based on the information received the anomaly is rectified to 
some extent, the user's image of the world is altered somewhat. But this is not the end of the process. 
The altered state of the user's knowledge generates new anomalies which she then takes back to the 
information retrieval engine in the form of requests or queries. This iterative process continues till the 
user is satisfied with the extent of the change in his state of knowledge. 

Belkin et. al.'s work has been refined in several directions (we refer the reader to Marchioni's 
book for a survey |24J) but the general view remains a powerful organizing principle for information 
retrieval. This general view is borne out in the World Wide Web setting by Rose's conclusion [27] 
- based on earlier studies he and Levinson conducted [28] - that information search was an iterative 
process. Other supporting evidence in this regard is Aula et. al.'s finding that experienced Web user's 
showed a pattern of searching, then browsing, the searching again [2], which was similar to Cothey's 
finding in a longitudinal study that followed subjects evolving from novice to expert [8]. 

With all of this as background we characterize the information gathering task on the World Wide 
Web as an iterative one (see Figure [2|). This task is initiated by an understanding that the user's 
knowledge needs to be augmented, proceeds by looking for Web pages (either from search engines or 
other sources) and then browsing them when they are found. The process of browsing rectifies the 
knowledge anomaly partially or wholly and has the additional effect of providing links to other pages 
which might aid the process. The user may then choose to follow those other links (immediately or 
later) or finish browsing the current page. When a page is browsed completely the user is either satisfied 
entirely and terminates the task, or is partially satisfied and resumes the task by again looking for Web 
pages. 

This iterative characterizations of the information-gathering tasks suggests ways in which search 
engines could aid users in performing them. It is to this that we now turn our attention. 

2.2 Enabhng information-gathering tasks 

In Figure [2] we notice that a user browsing a Web page in the process of gathering information treats the 
page either as a source of information or as a source of links to other pages. It is therefore appropriate 
to provide users with links to two kinds of pages: 
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I am not satisfied witli 
the concepts I currently 
possess. 



I look for pages that 
may help me augment 
my knowledge in a 
certain domain 



I learned something 
but not enough 
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That page was not 
much use 



T 



This link looks 
promising 
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I browse a Web page 



I am satisfied ... for now 



Figure 2: A schematic of information-gathering on the Web. 
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PI. Pages which contain information similar to the current page. 

P2. Pages which provide hnks similar to the links provided by the current page. 

Additionally it is our contention that as user experience with the Web improves, there will be 
the realization that people who create Web content and Web links have an understanding of the 
interrelationships between various pages. And so we suggest that a third kind of page could be useful 
in the information- gathering process: 

P3. Pages which can be reached by following a sequence of links from the current page and pages 
from which the current page can be reached by following a sequence of links. 

The specifications PI and P2 are general and open to interpretation as to how they might be 
satisfied. The pages of P3 are more specific but there is still the question of which one of these pages 
to choose to present to the user from among the several thousand that might satisfy this criterion. In 
Section S] we will provide three scoring mechanisms: FactRel will account for pages of type PI, SeekRel 
for P2 and SurfRel for P3. 

How should these links be provided to the user? While this is a question better addressed by experts 
in human-computer interaction, we would suggest that the "toolbars" provided by many search engines 
could be augmented to provide page recommendations. These already provide information about the 
"rank" of the page currently being viewed and various other pieces of information from the search 
engine's bag, and this additional use can fit in seamlessly. For searches that take place on a search 
engine's Web page, each "similar pages" link can contain these related pages. In either case it is of 
paramount importance that the links provided be appropriately categorized so that the user can choose 
to follow them (or not) depending on the particular function the page being viewed (or the initial search 
result) plays in her information-gathering process. 

3 Related work 

The relationship between Web pages has been studied fairly extensively. Most researchers use the term 
"similarity" and define it variously. In this paper we have consciously avoided the use of this term 
since we believe that the relationship between Web pages is not intrinsic to the pages but depends on 
the functionality of the pages for the user. However, in order to survey previous contributions to this 
area, in this section we will use the term "similarity." 

There are two broad categories of approaches to the Web page similarity problem. The first relies 
mainly on the textual content of Web pages. The pair of pages being compared is either seen just 
as two groups of items (keywords, anchor text, patterns in the text) which overlap significantly or as 
structured entities that resemble each other in their organization e.g. which tags appear next to which 
ones. Since this is not our approach we do not review this vast literature here, referring the reader 
to [30] for a succinct survey. 

The second approach involves taking link and interconnection information into account. An im- 
portant way in which interconnection has been used is to ascribe authority to a page based on which 
pages link to it. This idea forms the basis of the Pagerank algorithm employed by Google [5]. In a 
similar vein, Kleinberg [18j described two attributes of Web pages: they can be authorities on a topic, 
or they can be hubs, linking to pages which are authoritative. He described an iterative algorithm to 
compute measures of these two attributes for each Web page. One line along which Kleinberg's work 
was developed involved using anchor text as a descriptive summary of the page being linked to [71 113j. 
But, more relevant to our methods was the focus on the link structure seen in Dean and Henzinger's 
paper |10j where they gave two algorithms for finding similar Web pages. One of their algorithms used 
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the idea of co-citation earlier seen in Pitkow and Pirolli's work [26\ . The second algorithm, called Com- 
panion shares an important feature with our scoring mechanisms: it uses Kleinberg's algorithm. Their 
method of building a focused graph extends Kleinberg's ideas. We have used their ideas for building 
capacitated subnetworks in our algorithm. While Dean and Henzinger give a list of pages similar to 
a given page, Huang et. al. [H] gave measures of similarity based on the predecessor and successor 
set, and also on the basis of all the vertices reachable from the two pages. Another sort of "closure" 
of co-citation was used to define a similarity measure called PageSim in [21]. In this measure, pages 
propagate their similarity measure to their neighbors, the importance of a particular propagation being 
decided by the PageRank of the page. This measure was an improvement on the SimRank measure 
proposed in [16] which worked on the principle that two pages are similar if they are linked to by similar 
pages. The SimRank measure was shown to be specific instance of a general framework for computing 
similarity between heterogeneous data objects by Xi et. al. [31] who proposed the SimFusion algorithm. 
In Section [4.31 we will compare the results PageSim and SimRank produce on a small toy example with 
the results given by our scoring mechanisms. 

In a different use of link structure related to our own Lu et. al. [231 [22] claimed that two pages were 
said to be similar if flow could be routed from one of them to the other. However, unlike our work, 
their capacity assignments were not based on any notion of authority. To the best of our knowledge 
this is the only other mention of using flow to score similarity in the literature. 

As mentioned, the literature on finding similarity is vast and has seen contributions from many 
different areas. The papers we have discussed above are the ones whose techniques are closely related 
to our own. With these in view we now proceed to describing our scoring mechanisms. 

4 Computing relationship scores 

In this section we describe our algorithms for scoring the relationships between a pair of pages u and 
V. It is our contention that two pages should have a high FactRel score if it is possible to find paths 
from multiple pages to both of them. We say that a page z witnesses the FactRel relationship between 
u and V if it is possible to reach both u and v from z by following a series of links. For example, in 
Figure [Sj the page G witnesses FactRel between A and B. Similarly we say that a page z witnesses 
the SeekRel relationship if it is possible to reach z by following a series of links from utoz and v to z. 
For example, in Figure [3l C witnesses SeekRel between G and /. F is another such witness for these 
two. Somewhat differently, the SurfRel relationship does not require explicit witnesses. We say that 
the SurfRel relationship exists between u and v if it is possible to reach u from v by following a series 
of links, or vice versa. In Figure [3l we see that H and B are related this way, while A and I, are not 
related by SurfRel. 

Hence our scores are based on finding paths in the World Wide Web. But which paths should be 
given more weight than others in the scoring process? To answer this question we rely on Kleinberg's 
notion [18] of authorities (pages which have credible information) and hubs (pages which link to good 
authorities) in order to derive a relevant set of focused capacitated subnetworks from the structure of 
the World Wide Web. We then determine relationship scores by sending flow to the two pages from 
witnesses (for FactRel) , from the two pages to witnesses (for SeekRel) or between the pages (for SurfRel) 
in these capacitated networks. 

We will now discuss these methods in detail. In Section HT] we will describe how to build our set of 
capacitated subnetworks using Kleinberg's method for assigning authority to Web pages. In Section [¥r2] 
we describe how to find witnesses, route flow and compute scores. We illustrate the working of the 
algorithm on a toy example in Section 14.31 and also present a comparison by scoring Simrank [16] and 
PageSim [21] on the same example. 
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Figure 3: A simple hyper linked network 



4.1 Building capacitated subnetworks 

We begin by assuming that we have a database of significant keywords, call it /C. We also assume 
that there is some function 7 : /C — ?► (0, 1) which assigns relative importance to these keywords. We 
do not specify how to build preferring to rely on the wealth of tools available for this purpose (see, 
e.g. [32]). For each of these keywords we apply Kleinberg's [18] HITS algorithm to compute hub and 
authority values for all the pages associated with that keyword. Finally, at the end of this preprocessing 
step each page has hub and authority values computed for the subset of the database of keywords it 
is associated with. When we are given a pair of pages to score we identify sets of significant keywords 
Ku and for u and v. We then merge these two sets to get the top k keywords for some tunable 
parameter k. This set of chosen keywords we call K. Once we have this set of keywords we proceed 
by making a set of networks M = {N^ \ w E K}. For each A''^ we first find the set of Web pages 
which contains the keyword w. 

Since we will be using Kleinberg's hub and authority values [18j to capacitate the network (in 
Section 14. 2p we grow by first taking all the pages that link to the pages in and the pages that 
are linked to by the pages in P^ (as described in \18\). Then we add the refinements to this structure 
proposed by Dean and Henzinger [10]. Only the main feature of these refinements has been mentioned 
in Steps [2c] and [2dl The reader is referred to [10] for further details. 

Finally, for each network N^j we run Kleinberg's algorithm for assigning hub and authority values 
to each page and label a node z of network with these values: hub^(2:) and auth^(z). A directed 
edge from node x to node y is assigned capacity hub^(rE). 

A summary of the algorithm is in Figure HI We postpone a discussion of the motivation for this 
construction to Section [4.31 

4.2 Using flows to compute relationship scores 

Finding witnesses. For both SeekRel and FactRel we have to find witnesses in each A^. In Figure [5] 
we describe a simple algorithm that uses breadth-first search from both u and v upto d levels for some 
value of d to return a sorted list, 3^, of witnesses for SeekRel. Note that we do not just create a set 
of witnesses, but actually make an ordered list of witnesses. The significance of this will become clear 
shortly. In order to construct a list of witnesses for FactRel we simply reverse the direction of all the 
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Algorithm constructSubnetworks(fc) 

1. Identify keywords and Ky for which u and v respectively have high hub and 
authority values. Pick the top k elements of H K^. Call this set K. 

2. For each w ^ K build as follows 

(a) Take the set of pages Pw containing keyword w. 

(b) Grow Pw into Gw by taking pages that link to P^ and pages that are linked 
from Puj. 

(c) Augment Gw by including pages that share an outlink with the pages in P^,. 

(d) Augment G^ by adding other pages linked to by pages that link to the pages 
of P^„. 

(e) Nyj = {G^ , E.^ ) where is the set of (directed) edges connecting pages in 
Gw 

(f ) Run Kleinberg's algorithm [18] to assign hub and authority values to each node 
X € Gw, hub^(a;) and authu,(a:) respectively. 

(g) Assign (directed) edge (x, y) (where x is the origin of the edge and y is the 
cndpoint) capacity Cw(x,y) = hubu,(a:). 

Figure 4: Building a set of capacitated subnetworks 



edges of N^] and execute exactly the same algorithm. We denote the set of witnesses for FactRel by 
Fiu{u,v). Note that the reversal of edges is only to find witnesses, not to compute flows, that takes 
place on the same graph for both SeekRel and FactRel. 

Algorithm makeSeekWitnessList(A^u,, d, u, v) 

1. Perform a BFS for d levels starting from u. Put all vertices encountered in Sd{u). 

2. Similarly construct Sd{v) by performing a BFS for d levels from v. 

3. Let Sw{u,v) ^ Sdiu)n Sd{v). 

4. For each x G S^iu, v) compute min{/i(w, x), h{u, x)} where h{x, y) is the number of 
hops in a directed path from x to y. 

5. Return S'tu(u,w) sorted in ascending order of mm{h{v,x),h{u,x)}^ breaking ties 
between two witnesses x and y by comparing the value of max{/i(u, x), h{u, x)} with 
max{/i(?;, y), h{u,y)}, and further breaking ties arbitrarily if they are the same. 



Figure 5: Finding SeekRel witnesses for u and v. 



Computing flows. Having constructed the ordered list of witnesses we go down the list one vertex 
at a time using any standard single-source maximum flow algorithm for computing the max flow first 
from u to or from the witness as required, then from v. Note that when computing the flow from 
n to a witness we eliminate v from the network and vice versa. This is to ensure that the one page 
doesn't piggyback on the other in order to route flow to a witness i.e. all the flow v sends to a witness 
is independent of u and vice versa. After computing the flow for a particular witness, we reduce the 
capacity of the edges into or out of the witness. Let us postpone discussing the rationale and method 
for this to the end of this section. We then move on to the next witness in the list. See Figure [6] for 
a formal description of the algorithm for computing flows for SeekRel. The algorithm for computing 
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flows for FactRel is symmetric. In this case we denote the flow from a witness x as factFlow^(x). 

Algorithm flowSeek(A^u„ d, u, v) 

1. L makeSeekWitnessList(A^^, d, u, v). 

2. While 

(a) Extract the first element x from L. 

(b) flowij,(?i, x) max flow from u to x in iV^, \ {v}. 

(c) fioWtu(i', x) <— max flow from v to x in Ny^, \ {u}. 

(d) scokFlow,i, (x) ^ min{flowtu(M, x), flow^(i', x)}. 

(e) Call reduceSeekCapacity(x). 

(f) L^L\{x}. 

Figure 6: Finding the witness flow for SeekRel 

The scores. In order to compute the relationship scores we have to be able to combine flows from 
different subnetworks. In order to do this we normalize all flows by dividing by the weight of the 
maximum edge in the network, a quantity we denote by maxwt(u;). Also, we factor in the relative 
importance of the various keywords in K using the function 7. Hence our first two scores are 



SeekRel{u, v) 


= E 

w&K 


l{w) 1 
maxwt('f/;) 


seekFloWt„(a;) . 




FactRel{u, v) 


= E 




factFloWty(x) . 


maxwt('i/;) 



The rationale behind FactRel is that pages providing similar information will be identified by witnesses 
in the network, which act collaboratively to identify good sources of information and link to them. 
Similarly, the intuition between SeekRel is that two pages which allow a user to reach more or less the 
same pages are related in terms of their ability to guide the user as she navigates the Web. 

The third measure, SurfRel is easier to compute since it requires no witnesses. We simply compute 
the max flow from u to in iV^, denoting it flowu,(« — ^ v) and the flow from v to u, denoted flow^(i; — )• 
u). Now, we can say that 



SurfReliu v) = , , • fiow^^fu ->■ v). 

■' ^ ' ^maxwtfit;) ^ ' 

SurfRel encapsulates the idea that if the Web allows one page to reach another through a simple 
sequence of clicks, these two pages must be related because they are both likely to be visited in a 
single browsing session. This idea of two pages being related by the presence of a path between them 
is very intuitive and the concept of flow generalizes the idea of paths. By capacitating edges with 
the hub value of the nodes they originate at, we differentiate between nodes and their ability to allow 
information to propagate by looking at their credibility as hubs in the hyper linked environment. The 
more the credibility of the node as a hub, the more the flow it can forward. 

Reducing witness capacity. In the case of SeekRel and FactRel, the presence of many witnesses that 
can sink or source a lot of flow from the two pages being scored (or send a lot of flow to them) leads 
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to a higher score for the pair. But there are cases where this score may be artificiahy high. Consider 
the network in Figured E, B, C and G all witness SeekRel for H and /. But the flow to B, C and 
G all goes through E. So these three are redundant, in the sense that the information they provide is 
already contained in the fact that E \s a, witness for H and I. 




Figure 7: Redundant witnesses. Capacities are marked on edges. 



It is to prevent these redundant witnesses from artificially inflating the relationship score that we 
reduce the capacity associated with the witness in Step [2e] of the flow computing algorithm of Figure [6l 
For SeekRel when we are done computing flow to a witness we reduce its incoming capacity before 
moving on to the next witness in the list. For FactRel the outgoing capacity is reduced. Before we 
describe the algorithm formally in Figure [8] let us deflne some notation. For a vertex x let the set of 
incoming edges be I{x) and the set of outgoing edges be 0{x). Let the flow routed for vertex u on 
edge e be fu{e). The capacity of edge e is c(e). 



Algorithm reduceSeekCapacity(a;) 

if flowu, (w, x) < flowu, (v, x) 
for each e G I{x) 

c(e) ^ max{c(e) - /„(e), 0}. 
for each e S I{x) 

c(e)^max{c(e)-/.(e).5^^-gg,0}. 

else 

for each e G I{x) 

c(e) <- max{c(e) - fy{e),0}. 
for each e G I{x) 

c(e)^max{c(e)-/.(e).fl^^,0}. 



Figure 8: Reducing capacities for SeekRel 



Essentially what reduceSeekCapacity(x) does is remove the amount of flow witnessed at x. Since 
we take the minimum of flow^(ti, x) and flow^(t;, x) as the amount of flow being witnessed, we remove 
this amount from the incoming capacity of x. And to ensure we do this fairly for both u and v, 
we penalize the incoming edges used by both the flows flow^(n, x) and flow^(w,x) equally by scaling 
down the larger flow to the smaller one before subtracting it from the capacity of the incoming edge. 
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The algorithm reduceFlowCapacity(3;) is symmetric to this, only it removes capacity from the outgoing 
edges of the witness. In both these cases, with the witness capacity reduced, the ability of redundant 
witnesses to skew the flow reduces. 

The reason for fixing an ordering for our witnesses becomes clear now since the capacities of the 
network decrease after each flow calculation. Clearly the order in which the witnesses are processed 
will make a difference to the flow that is routed to them. Recall the function h{u, v) defined in Figure [5] 
as the number of hops in a directed path from u to v. Let us see what the values of {h{x, v), h{x, u)) 
are foi v = H and u = I. For G this is (3,2), for C it is (2,1), for B it is (1,2) and for E it is (1,1). So 
the sorted order according to our algorithm should he E, B, C, G. For this the total flow witnessed 
will be 10 units for E plus 20 units for B = 30 units. Because of capacity reduction G and G will not 
be able to witness any flow. But is this correct given that / has a direct edge to G which doesn't go 
through B or El We argue it is since the flow of 30 that could be witnessed at G is the same as the 
flow of 30 units that is being witnessed for H at B and E. In Figure [7] for example, if we choose the 
witnesses in the reverse order G,G,B,E the total flow is 30 witnessed by -E + 30 witnessed by C + 
30 witnessed by i? + 10 witnessed by E, totalling 100, of which 70 units are redundant. 

Thus it follows that distant witnesses along a chain of nodes are more likely to be redundant since 
decreasing their capacity will not affect the flow to nearer witnesses. With this in mind, and also noting 
that more information about a Web page is likely to be found in its near neighborhood rather than far 
away from it, we order witnesses in increasing order of the distance of the witness from one of the pages 
in (StepOof algorithm makeSeekWitnessList(A^^, d, u, v) in FigureO If a distant witness still witnesses 
flow after the reduction of capacity of nearer witnesses, we can be sure that this is not redundant flow. 

4.3 Discussion 

We took the simple subnetwork of Figure [9] and ran our scoring algorithms on it. The table of scores 
obtained is in Figure [TOl For cleanness of presentation all hub values have been scaled by 1000. The 
flow values have been scaled up by maxwt = 815 since we are only considering one subnetwork. 
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Figure 9: A simple subnetwork with edge capacities according to the hub values of the originating 
node. 

In Figure [9] we see that node 2 is related by SeekRel to nodes and 3. This is as it should be 
since they both point to similar parts of the network. Node sends 368 units of flow to the witness 
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(0,1183,0,0) 
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(0,0,0,368) 


(0,0,0,253) 


(0,0,0,1183) 
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(0,368,0,0) 


(0,1183,0,0) 





Legend: {SeekRel{x , y) , FactRel{x , y) , Sm^fRel{x — > y),SurfRel{y — > x)) 



Figure 10: Relationship scores for the network of Figure [9] (scaled up by 1000 x maxwt) 
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Figure 11: High scorers for Figure [9] 



5 which is matched by 2. Node 3 sends 368 units of flow to witness 6, which is also matched by 2. 
And although the node 1 shares many witnesses with 0, the flow it can send is limited by its outgoing 
capacity (which is low because it is not a good hub) and so its SeekRel score is low, though non-zero, 
and 2 and 3 beat it out in scoring. 

If we look at the second column of the table in Figure [TTl we see that 5 and 6 are the top scorers 
for FactRel for several nodes. Node 2's high credibility as a hub helps draw attention to 5 and 6. The 
node O's good hub value helps relate 2 to 5, in what can seen to be an example of pure co-citation. 

In the case of SurfRel, the example of is interesting. While is strongly related to 5 by this 
measure, its score with 6 is relatively less. In this case our score captures the fact that there are 
multiple independent paths from to 5 while all paths from to 6 go through 2. The divergence in the 
paths after 2 does not help boost SurfRel for and 6. 

To further illustrate the power of our methods, we implemented the SimRank [16j and PageSim [21] 
scoring algorithms and scored our simple subnetwork using them. The results are in Figures [T2] and [T3l 
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Figure 12: SimRank scores for Figure [9l 



SimRank related none of the pages to either or 1 whereas our SeekRel is able to detect the fact 
that can aid in helping the user find links to pages that 2 and 3 can also lead to. Even 1 shares 
this property as a navigational aid with some of the other pages, a fact that comes up in our scoring. 
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The case of node 1 is particularly interesting because PageSim, that gives non-zero scores in many 
cases where SimRank fails, does not deduce I's relationship to and 2 that our SeekRel is able to find. 
A user currently viewing 1 would come to believe that only 3 and 4 are related to 1 if she relied on 
PageSim or SimRank. This would be erroneous because a knowledge that is related in terms of links 
it provides could lead that user to 2, which she would not find if she relied on these other two measures. 
PageSim is somewhat more sophisticated than SimRank so it detects O's relationship to 5, just like our 
SurfRel — )• does, but it misses O's relationship to 3 that we find through SeekRel. 
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Figure 13: PageSim scores for Figure [H 



PageSim almost misses 5's relationship to 4 and also scores 5's relationship to 6 quite low. SimRank 
completely misses the relationship to 4 and scores the relationship to 6 lower than the relationship to 
2. On the other hand, a high FactRel score for both of these allows a user to tell that the information 
available at 4 and 6 are both relevant to people who are interested in 5. Since our FactRel score between 
5 and 2 is relatively lower and our SurfRel score between them is high, a user can deduce the nature 
of the relationship between 5 and 2, a fact also detected by SimRank. 

We now move on to experiments on real data taken from the Web. 

5 Experimental evaluation 

5.1 Experimental setup 

We performed our experiments on four data sets taken from the Web. Creating these data sets was a 
multi-stage process that began by querying AltaVista [1] with a search string and taking the top 100 
results to form a core set. We did not use Google since we compare our results to Google's Similar 
Pages feature. We then used the open source Web crawler Nutch [25] to retrieve the pages linked 
from the core set. Then we found the top 1000 pages that link to these new pages using AltaVista's 
advanced feature providing inlinks for a queried page. Finally, we found the inlinks of the pages in the 
core using Altavista then went back to Nutch to find the outlinks of these pages. We followed Dean and 
Henzinger }10j and took only the top 10 outlinks in the manner they specified i.e. if we were looking 
at the outlinks of a page u which pointed to a core page v, we took only the links on u which were 
"around" the link to v in the sense that we took the 5 links immediately preceding the link to v on 
the page and the 5 links immediately following v. Having obtained this data set we preprocessed it by 
computing the hub and authority values of all the pages in it. 

Our four data sets were generated using the keyword strings "automobile" (54952 pages), "motor 
company" (14973 pages), "clothes shopping" (37724 pages) and "guess" (12101 urls). For repeatability 
purposes, these data sets have been made available online|l| We conducted extensive experiments on all 
these data sets by taking one page out of them as a query, then scoring all three relationships for this 

^http: / / www.cse.iitd.ernet.in /~bagchi / relationship-scores / 
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page with all the other pages in the data set. Note that this does not exactly correspond to our claim 
in Section H] that we proceed by extracting keywords from pairs of pages and then computing flows on 
the subnetworks obtained from those keywords. Limitations on the amount of data we were equipped 
to handle in a university setting prevented us from performing these steps in the full. We present here 
these stripped down experiments as indicative of what a full implementation of our scoring mechanisms 
might be able to achieve. 

We compared our 10 top scoring pages for FactRel and SeekRel with the top 10 pages returned by 
Google's Similar Pages feature. We also implemented the Companion algorithm described in [TO] and 
compared our results to the top 10 results returned by it. For SurfRel we simply took our top 10 results 
and evaluated them. The evaluation in all these cases was done by conducting user surveys. 

For each of the target URLs scored using FactRel we asked the user to imagine they had visted it 
in the course of an information-gathering task and found it relevant. We then assembled a set of 30 
URLs {FactRel^s top 10, Google's top 10 and Companion's top 10). We presented these 30 URLs in a 
random order and asked users to answer three yes/no questions: 1) Would you visit this page if you 
had already visited the target page? 2) Does this page provide similar information to the target page? 
and 3) Is this page relevant to your information- gathering task? 

Each such survey was given to between 5 and 8 users. For each of the 30 pages, and each of these 
3 questions a relevance score was computed. For a given target URL t, and a result page i: 

„ , , , Number of YES answers 

Rel(t, i) 



Number of users who took the survey 



For each target page and each algorithm, the Precision at r of the ranked results was computed using 
the formula 

T. ■ • / N ELiReim 

Precision-at-r(t, r) = . 

r 

The precision at r for an algorithm was computed by taking the average of the precision at r values 
over all the target pages evaluated. 

For SeekRel the first and the third question remained the same. The second question was replaced 
with Does this page provide links similar to those in the target page?. Precision at r was calculated 
similarly for SeekRel. For SurfRel we only asked one question: Is this page relevant to your task? and 
we did not present results from Google or Companion. 

The code for all scoring mechanisms was written in Java (JDK-1.5.0). The open source Nutch 
crawler was downloaded and run and a parser was written in C++ to parse its output. A total of 9 
target pages were evaluated for FactRel, 7 for SeekRel and 5 for SurfRel. On the 14973 page "motor 
company" data set finding all three relationship scores between a given query page and all the other 
query pages on a desktop PC with a 3.4 GHz Intel Pentium processor with 1GB RAM took about 8 
minutes on average for this data set. On the 37,724 page "clothes shopping" data set it took about 1 
hour on average to calculate all three scores of a given page with all other pages. Let us now see what 
the experiments revealed. 



5.2 Experimental results 

In Figure [T4l we list the 10 URLs that scored the highest on FactRel for the page www.honda.com. The 
relationship revealed is expected: other major car companies. More interesting is a list of pages related 
by FactRel to www.cngvehicle.com. Not only do we get pages related to other alternate fuels (Biodiesel 
Forum (forums.biodiesel.com). Electric Drive Transportation Association (evaa.org) and government 
agencies dealing with renewable energy policy, we also get links to private car industry players who are 
pursuing the development of energy efficient cars. 
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www.honda.com 


www. cngvehicle .com 


1 


www. ford. com 


www.evaa.org 


2 


www.toyota.com 


forums . biodieselnow. com 


3 


www . landrover . com 


www.eere.energy.gov/cleancities 


4 


www.audi.com 


www.ford.com 
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www.gm.com 


www.nrel.gov 


6 


www.mercmyvehicles.com 


www.eere.energy.gov/cleancities/- 


7 


www .Cadillac .com 


www.gsa.gov/Portal/gsa/ep/ — 


8 


www . Chevrolet .com 


www .mercury vehicles .com 


9 


www.lincoln.com 


www.gm.com 


10 


www .porsche . com 


www.honda.com 



Figure 14: FactRel top scorers for two pages of the "motor company" data set. 
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www. neimanmarcus . com 
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www.nordstrom.com 
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www . abercrombie. com 


www . neimanmarcus .com 
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www. bananarepublic . com 


www.barneys.com 
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www.bluefly.com 


www.fds.com 
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www.spiegel.com 


www . Starbucks . com 
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www. saksfift havenue . com 


www.walmart.com 
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www.target.com 


www. nycvisit .com 


10 


www. eddiebauer . com 


www.ritzcarlton.com 



Figure 15: FactRel vs Google's "similar pages" for www.bloomingdales.com 



For the Web page of the clothing store Bloomingdaleqj, FactRel and Google's "similar pages" 
returned more or less identical lists right at the top, but FactReFs high scorers remained very focused 
(other apparel stores) while Google provided links to coffee shops and hotels which could possibly be 
appropriate in some contexts but deviate from what is arguably the main focus of a user visiting the 
Bloomingdale's Web page (see Figure [T5]) . 

As another demonstration of FactReTs reliability in providing alternate sources of information, we 
present its top scorers for www.mysimon.com, a comparison shopping site, in Figure [TBI FactRel scored 
Web pages for major comparison shopping sites very high. 

To test the robustness of FactRel we scored the home page of Guess Jeans (www.guess.com) using 
not just the "clothes shopping" data set but also the "guess" data set. Despite the presence of the 
ambiguous keyword "guess", FactReVs top 10 results were closely related to the original page: pages 
relating to clothing and accessories. Google's similar pages, on the other hand, appeared to get severely 
misled by the keyword "guess" (see Figure [T7|) . The precision at r graph for www.guess.com in Figure [T8] 
reveals that Google does very poorly while FactRel and Companion provide good results. 

In general we found that FactReFs results were substantially better than Google's but not better 
than those returned by Companion. In Figure [T9l we see that users preferred to visit the top 10 pages 
presented by FactRel over those of Google after having visited the target page. 

For SeekRel the picture was more complex. While users generally felt that the pages presented by 
SeekRel were far better than the results presented by Google and Companion in terms of the similarity 

www. bloomingdales . com 
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FactRel for www.mysiinon.com 
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www.dealtime.com 
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shopping.yahoo.com 
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www.ebay.com 
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www.bizrate.com 
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www.pricegrabber.com 
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www.nextag.com 
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www.become.com 
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www.alibris.com 
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www.buy.com 


10 


www .www . bestbuy. com 



Figure 16: FactRel high scorers for www.mysimon.com 





FactRel 


Google's "similar pages" 
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www.gap.com 


www.guessthename.com 
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www.gucci.com 


www.onlineshoes.com/ . . . 
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www.marciano.com 


www . sonypicturcs . com/ . . . 
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www .guessinc.com 


www.imdb.com/title/tt0372237 
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www.jcrew.com 


www . amazon .com/Guess/... 
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www . gby guess . com 


www.bizrate. com/... gucss+bags. html 
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www.hugoboss.com 


www.learner.org/ . . . 
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www . givenchy. com 


popular .ebay.com/. ..Guess+ Jeans. html 
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www.gianfrancoferre.com 


www.answers.com/topic/ guess-inc 


10 


www.diesel.com 


www . guessfinancial .com 



Figure 17: FactRel vs Google's "similar pages" for www.guess.com 
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Figure 18: Precision at r for the relevance question for www.guess.com. 
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Figure 19: Precision at r for the visit question for the FactRel target pages. 
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Figure 20: Precision at r for the similar hnks question for the SeekRel target pages. 
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Figure 21: Precision at r for the visit question for the SeekRel target pages. 



www .shoppingcolumn.com/personal- shopping, ht ml 

www.misslist.com/stores/shoes.html 
www.arlingtoncards.com/aroundtown/bizshopl.htm 
www.allwi.com/wipresents.html 
www. digital-librarian, com /shopping . html 
www.cooll055.com/lc/features/shopping 
www.ersys.com/usa/13/1349000/mall.htm 

Figure 22: SeekRel top scorers for ersys. corn's mall information page for Santa Clara, CA. 



of the links on them to those on the target page (see Figure [20]) . they preferred to vist the pages 
presented by the other two algorithms (see Figure [2T]) . This is strong independent evidence in favour 
of Aula et. al.'s conclusion [2j that Web users prefer to browse rather than search. Rather than visit 
another page with links similar to a given page they would rather visit a page with actual information 
on it. 

Another possible drawback in SeekRel was revealed when we scored for a page listing all the malls 
in the Santa Clara, California aresH (see Figure [22]) . We found that pages with links to online shopping 
resources and even personal shopping options scored high. But we also found that pages with local 
information for places as far afield from Santa Clara as Macon, Georgia appeared near the top of the 
list. The absence of geographical domain knowledge in our system shows up here. 

The user response to SurfRel was fairly good. Almost half of our top ten results were found relevant 
by the respondents (see Figure [23]) . 

3 www.ersys.com /usa/06 / 0669084 / mall. htm. 
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Figure 23: Precision at r for SurfRel. 

6 Discussion 

Our scoring mechanisms are designed with a view to integrate the two broad streams of thought on 
web page relationships: textual and link-based. This goes some way in addressing Lawrence and 
Giles' criticism that search engines are biased towards pages which are well- linked [19] and is hence 
an advantage over algorithms like Dean and Henzinger's Companion which take only link information 
into consideration. 

Although users ranked our results in the same ballpark as Companion, it is our contention that 
our algorithms are much less resource intensive and much more suited to inclusion in a real-world 
search engine. We maintain a deck of subnetworks, one corresponding to each significant keyword. It is 
difficult to estimate the number of significant keywords but if we take the widely used lexical database 
WordNet |llj as an indicator, the number is of the order of 100,0000 The Companion algorithm, on 
the other hand, creates a subnetwork for each queried page. Creating the subnetwork at query time 
is difficult because of the overhead involved in crawling the Web (or even an image of the web stored 
on disk) and preprocessing and storing these networks appears infeasible given that the size of the 
World Wide Web is estimated to be in the tens of billions of pages [9] . Even if it were feasible to store 
structural information on such a scale, the problem of updation is hard to solve. The Web is constantly 
changing, and updating our keyword-based subnetworks will be an order of magnitude less resource 
intensive than updating billions of page-specific subnetworks. 

Our approach is further vindicated by the observation that the results provided by FactRel and 
SeekRel clearly outperform Google's Similar Pages. We have tried to gracefully bring together textual 
and link information in a common framework where one can compensate for the shortcomings of the 
other. The Guess Jeans example presented above demonstrates that our scores can leverage link 
information to handle ambiguous keywords in a manner better than Google can. 

The main contribution of this paper, in our view, is the location of our thinking on how to relate 

''WordNet 3.0 contains 155,287 distinct strings. 
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pages in the context of user intent. As part of our future research agenda we want to formulate 
relationships between pages that can service user intent outside the domain of information-gathering. 
We also want to test the applicability of our methods in social networking situations and user-generated 
content scenarios. 
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